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I 

Introduotlon. 

The  objeot  of  this  paper  is  two-fold:   (l)  to  present 
some  empirioaJ.  evidence  as  to  the  nature  of  frequency  constants 
determined  from  small  samples;  and  (2)  to  give  a  short  algebraic 
treatment  of  the  theoretical  problem  involved. 

The  science  of  statistics  grew  out  of  the  fact  that 
data    gathered  from  lar^^e  numbers,  for  any  given  measurable  char- 
ajster,  tend  to  fall  into  systematic  order.  A  mathematical  descrip- 
tion of  this  tendency  forms  the  basis  for  the  science  as  it  is 
today . 

Until  within  the  past  few  years,  the  question  of  the 
effect  of  small  numbers  upon  the  accuracy  of  such  a  matHemaical 
description  was  not  raised.  Astronomers  and  biologists  and  peda- 
gogists,  with  a  faith  amounting  almost  to  sublimity,  proceeded 
to  apply  to  cases  involving  from  a  half  dozen  to  half  a  hundred 
numerical  measurements  laws  which  had  been  derived  for  large 
numbers.  There  is  little  doubt  that  sometimes  the  nature  of  the 
problem  justified  that  faith;  there  is  no  doubt  whatever  that 
often  conclusions  have  been  drawn,  and  heralded  far  and  wide, 
for  which  no  scientific  criterion  existed. 

The  crux  of  the  whole  problem  centers  in  that  word 
"large".  To  the  mathematician,  "large"  in  statistics  means  large 
enuf  to  rea^h  a  desired  degree  of  approximation.  That  is,  all  the 

*  See  Biometrika,  vol.  vi.  "On  the  Probable  Error  of  the 
Correlation  Coefficient",  and  "en  the  Probable  Error  of  a  Mean". 


formulae  of  statistios,  as  they  are  used  in  practice,  are  but 
approximations  whioh  axe  supposed  not  to  fluctuate  beyond  devia- 
tions which  might  reasonably  be  expected  from  "random  saiv.pling" . * 

Just  how  large  a  population  of  variates  is  necessary 
in  order  that  a  given  formula  may  be  applicable  has  not  yet  been 

determined.  It  has  been  customary  to  check  up  by  common  sense  

a  ver3'-  commendable  procedure  at  all  times  but  often  the  intri- 
cacies of  the  situation  make  that  impossible.  The  problem  is  one 
which  can  be  solved  only  by  an  examination  of  the  effect  which  is 
inherent  in  the  formulae .  Most  of  the  formulae  used  were  intended 
to  be  applied  to  oases  involving  large  numbers — for  the  sake  of 
definiteness  let  us  say  a  thousand  or  more.  Many  investigations 
involve  less  than  a  hundred  variates.  What  effect  does  this  have 
upon  the  accuracy  of  the  results  obtained? 

Probable  error  formulae  have  aided  much  in  determining 
the  value  of  statistical  investigations,  but  there  still  remains 
the  query,  "Given  a  certain  percentage  of  a  population,  what  sire 
the  chances  that  that  sample  represents  the  characteristics  of 
the  population  as  a  whole?"  The  discussion  here  presented  deals 
only  with  general  tendencies .  It  is  by  no  means  a  complete  solu- 
tion of  the  problem. 

*  A  discussion  of  "orders  of  magnitude"  and  "degress  of  approx- 
imation" may  be  found  in  the  Transactions  of  the  Cambridge  Philo- 
sophical Society,  volume  20,  p. 131  ff.     It  is  in  the  appendix  of 
Professor  Edgeworths's  paper  on  "The  Law  of  Error". 
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II 

Empirioal  Evidence . 

In  the  first  place,  I  set  about  empirically  to  seek 
evidence  of  the  ef:'ect  of  drawing  small  samples  from  a  large  pop- 
ulation, choosing  for  this  purpose  a  table  showing  correlation 
between  length  of  ear  and  weight  of  ear  in  Indian  corn. (See  p. 4). 

Upon  990  bone  buttons  I  placed  the  figures  correspond- 
ing to  the  respective  weights  and  lengths  shown  in  the  table. 
Having  thoroly  mixed  them(the  smooth  surface  of  the  buttons  fa- 
cilitated this  process),  I  drew  at  random  from  the  mixture  sets 
of  ten,  in  each  instance  mixing  the  remainder  thoroly  before  draw- 
ing the  next  sample.  This  gave  99  correlation  tables,  each  con- 
taining ten  variates. 

I  computed  the  means,  the  standard  deviations,  and  the 
correlation  coefficient  for  each  of  the  99  groups.  This  gave  a 
frequency  distribution  of  99  for  each  set  of  constants.  The  aver- 
age of  these  I  compared  with  the  corresponding  constants  for  the 
whole  population  of  990.  In  addition,  I  computed  the  moments  of 
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It  was  deemed  impraotioable  to  reproduce  the  complete 
data  for  the  99  sub-groups.  It  might,  however,  be  well  to  state 
that  in  so  far  as  possible  the  groupings  were  made  in  suoh  a  man- 
ner that  no  undue  weight  was  given  to  repeated  digits  arising  from 
the  nature  of  the  computation.  That  is,  if  a  series  of  numbers 
ended  in  7 (say),  that  digit  was  not  permitted  to  swing  the  whole 
series  in  either  direction.  The  size  of  the  groupings  was  deter- 
mined arbitrarily,  but  the  mark  about  which  the  lowest  group  in 
each  class  was  centered  was  chosen  in  such  a  manner  that  the  divi- 
sion point  came  at  a  decided  break  in  the  series. 

To  aid  in  the  discussion  of  the  experiment,  the  follow- 
ing notation  is  introduced: 
R  =  tne  correlation  coefficient  of  the  entire  population. 
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Let  r,  si,s^  ,mi  ,m^  represent  corresponding  constants 
in  any  sub-group.  Use  r  and  s^ for  weighted  arithmetic  means  of 
the  r's  and  the  s's. 


Comparative 

Results . 

R  = 

0.574 

r  = 

0.555 

Si  = 

0.692 

Si  = 

0.620 

^w  = 

0.451 

^w  = 

0.410 

Ml  = 

8.656 

mi  = 

8.649 

320.5 

320.2 

From  this  it  would  seem  to  follow  that  in  the  long  run 
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stsmdard  deviaticns  determined  from  small  samples  are  too  snail; 
that  there  is  praotically  no  difference  in  means  thus  obtained; 
and  that  the  correlation  coefficient  is  slightly  too  low  in  value 

Some  subsidiary  considerations  may  be  of  interest. 
For  excLiaple,  note  the  following  comparisons  on  fluctuations  of 
the  various  constants,  as  determined  from  Standard  Error*f ormulae 
and  as  computed  from  the  distribution  actually  obtained. 

Constant.  By  Standard  Error*  By  Experiment. 

Ml  .207  .219 

.136  .143 
Si  .163  .155 

S^  .103  .101 

R  .265  .212 

With  the  exception  of  the  standard  error  in  R,  these  check 
fairly  closely.  That  is,  the  use  of  the  Standard  Error  formulae 
on  the  constants  of  the  whole  population  gives  an  estimated 
fluctuation  which  is  fairly  well  borne  out  by  the  experimental 
evidence . 

The  following  schematic  arrangement  shews  the  nature 

of  the  distributions  with  reference  to  Pearson's  curves: 

2  2 

Constant.  Bn=    -2|    Bp=  K=  ^1  [^P'*^^  ^..^  Type. 

1  5      ^         2  4(4B2-333_)  (aB^-3Bi-6) 


^1  .855        3.07  -0.334  I 


^w 


si 


.315  .125  0.259  IV 

.000        2.09  -0.000  II(I) 


s^  .709        4.85  0.403  IV 

r  2.19         6.08  -6.04  I 

*Yule,G.U.;   "An  Introduction  to  tne  Theory  of  Statistics" ,p .347 . 
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Theoretical  Treatment. 

In  attacking  the  problem  from  ,the  theoretical  side,  the 
method  used  was  purely  algebraic,  and  was  based  on  the  essentials 
of  the  definitions  of  the  constants  involved. 

An  idea  of  the  method  can  perhaps  best  be  gained  by- 
working  out  the  simple  case  of  the  mean. 

Let  N  represent  the  number  of  variates  in  a  given  pop- 
ulation, n  the  number  in  a„ay  sub-group,  and  k  =    M      the  number 

n 

of  sub-groups.  Other  constants  have  the  same  significance  as 
heretofore (See  p. 5).    Use  #  for  a  sign  of  summation. 
By  definition, 

N  n^ 

N  M  =  #(x)  =  #  X  i  =  1,2,  k. 

1  1 

The  summation  refers  to  the  variable  x,  and  i  is  to  take  values 
1,2,3,  k,  so  that  the  second  summation  refers  to  k  differ- 
ent summations,  one  for  each  sub-group,  and  the  x  in  each  sub- 
group takes  on  as  many  different  values  as  there  are  units  in 
the  n  for  tnat  sub-group,  that  is,  in  the  expanded  form 

#x-#x*#x-+=^x-     ---    4#x 

1111  1 
This  is  given  somewiiat  in  detail  because  the  same  notation  is 

used  thruout,  and  hereafter  only  the  abbreviated  formappears. 
We  have  then, 

M  ^  n-^m;;^4np^m;^-    -^nymy 

ni+n24  ^n^^ 

after  dividing  thru  by  H,  which  is  equal  to  the  sum  of  the  n's. 

This  gives  the  well-known  result  that  the  weighted  arithmetic 

mean  of  the  means  of  the  sub- groups  is  equal  to  the  mean  of  the 

entire  population. 


* 
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Next  consider  the  distribution  of  standard  deviations. 
By  definition, 

N  s^-  #  (x-M)^  =  #^(x-M)^        i  =  1,2,  k.  (1) 

1  1 

2 

=  #  (x-nii+nii-M)  "  * 

1 

=  #^(x-mi)^  4     2  #^(x-ni)  (iiii-M)^  S^Ui-M)^ 
111 

=  P  n.s?  +0  ^  n.  (m-M)?  (2) 

1 

Expansion  of  the  middle  term  easily  shows  it  to  vanish.  The 
last  term  may  be  reduced  as  follows: 

n(m-M)^=  n(m-:.:)n(m-M)l  =    ^(nm-nM)^=  l(x-,  +  xo4  --■fXr.-nM? 

2 

=  i|^x.-M)+(x2-M)*  +(x^-M)^  =    H  f  ^-"""^'^^ 

plus  cross-product  terras  which  vanish  with  random  sampling. 
By  means  of (3)  equation(2)  becomes 

N  S^=  nisf+  +  ni,s|  ^  . 

The  last  part  of  this  contains  the  same  summations  as(l),  and  if 
we  assume  the  n's  all  equal, say  to  n,  tiiere  is  a  common  factor  i. 
Divide  thru  by  N  and  the  result  is 

^  where  ses  before  the 

dash  over  the  s^  represents  a  weignted  arithmetic  mean.  We  have 

finally,  n-1  (4) 

Thus  it  is  seen  thit  the  rela- 
tion between        and  s^  is  independent  of  the  number  in  the  origi- 
nal population. 


On  tne  other  hand,  the  size  of  tne  sub-groups  has  a  distinct 
bearing.  In  the  experiment  already  described,  n-10.  So  we  wou]d 
expect  the  average  of  the  standard  deviations  of  the  99  groups  to 
be  nine- tenths  tiiat  of  the  population  of  990.     The  result  obtained 
was, for  the  standard  deviations  of  lengtii  of  ear, 

s^  =  0.435  ;         10  ^1  =  ^'^^"^ 

This  serves  as  am  illustration.  I  have  no  doubt  that  a  similar 
coincidence  might  be  expected  for  the  distribution  of  weignts. 

One  word  of  caution  is  necessary.  It  is  to  be  noted 
that  the  above  results  were    computed  on  the  squares  of  the  con- 
stants, and  that  the  average  is  an  average  of  squares ,  The  fact 
that  the  standard  deviations  of  the  sub-groups  are  correlated* 
means  that  in  general  the  average  of  the  squares  is  different 
from  the  square  of  the  average;  that  is,  s  — ^    s  .    So  we  may 
not  infer  that    /  S  =  s  ,  even  tho  formula(4)  be  true. 

For  example,  in  the  case  cited  above, 

s    =  0.620        J        /  _9_      g  =  0.656 

Fere  it  not  for  this  fluctuation  due  to  the  correlation  between 
the  s's  of  the  sub-groups,  formula (4)  would  give  a  very  accurate 
description  of  the  influence  of  small  numoers  on  standard  devia- 
tion. As  it  is,  there  is  shown  a  decided  tendency,  in  tne  long 
run,  for  s  to  be  too  snail .  It  is  to  be  noted  that  as  n  increases, 
the  degree  of  approximation  on  the  squares  approaches  unity.  For 
numbers  less  than  ten, however,  there  is  a  marked  discrepancy. 

*See  Biometrika,  vol. IX, p-n.  1  -  10,  Pearson's  article  entitled 

"On  the  Probable  Error  of  Frequency  Constants". 
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The  last  problem  is  that  of  the  distribution  of  oorre- 

lation  ooefiicients  from  small  samples.  For  a  population  of  N,  we 

have,  by  definition,  N 

W  S^SyR  =  #  (x-Mj,)  (y-:.Iy)  , 

where  tne  summation  ranges  on  x  and  y  simultaneously.  Proceeding 
as  before, 

N  S^SyR  =  #^(xi-Mx) (yi-My)  i  ^  1,2,   k.  (5) 

y  1 

n 


=  #^(xi-mx.*mx.-Mx)  (yi-my^+my.-%) 

^i  ^i 
=  #  (xi-m^.) (yi-my.)  +  ^  (xi-mx^) (my^-My) 


^i  .  .  .  ,  ^1 


nj_ 

=  #  niSxj^Sy^ri  ♦  0  ♦  0      n^  (mx^^-Mx)  (niyj_-My)  .  (6) 

By  a  prooess  exactly  like  tha.t  shown  in  reducing  equation(2)  on 

p. 8,  the  last  term  of  (6)  reduces  to    t  n^^ 

—      (xi.mxi)(li-my.),  (7) 

plus  terms  which  vanish  with  random  sampling.        By  means  of 

expression(7 ) ,  equation(6)  becomes 

1  ^i 

N  SxSyR  =  n3_Sx^Sy^ri4--*nkSx^Sy^rj,4  (^i-^XiHyi-niy}. 

The  last-terra  of  this  contains  the  same  summations  as(5),  and  if 
we  assume  the  n's  all  equal, say  to  n,  there  is  a  common  factor  ^* 
Divide  thru  by  N,  and  the  result  is 


SxSyR  =  s-jjSyr       -f  — ^  'SxSyR, 
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where  as  before  the  dash  represents  an  arithmetio  mean.  This  may 
be  written 

nzl  s  g  R    ^      °n°n^l" ^^'^k^Vk^k  .  (8) 

n         y  k 


Froin(4) , 
and 


^  fTTIE         '  Sx^*  4  s^^ 

Sy  =  .  /       .......  .  ^^^^ 


Combining(8) , (9)  and(lO),  we  find 

/    S^,-—    -B^j,   .3^^  * 

NOW,  r'=      ^^l^yiV  ^^--^k-Wk  .  (12) 

(I  haye  used  the  symbol  r'  to  indicate  that  this  is  a  weighted 
arithmetio  mean  with  a  very  special  kind  of  weighting.  The 
significance  of  this  comes  out  in  the  sequell . 

Since  the  expressions  for  R  and       are  identical  in 
the  numerators,  the  problem  resolves  itself  into  a  discussion  of 
the  relative  value  of  the  two  denominators.  For  this  purpose  I 
shall  establish  the  following 
LEMMA: - 

/a^4  .a^     *  /-^rTTTbf     \^    a^b^*-- +a^b^,  (13) 

an     ,  a-i 


the  numbers  being  positive,  and    ^  =5^= 


Squaring  botii  members  of  (13),  we  have 
lafbf*    #  afi  b^>#  afbf  *  3.#  a,a^b,bj     .  (14) 
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Transposing  the  terras  from  the  right-hand  member  of (14)   to  the 
left,  we  have 

n  2 . 

#  (aibj-  atjbi)    y  0, 

whioh  establishes  the  lemma. 

Application  of  this  lemma  to  equations(ll)  and(12)  a.t 

onoe  establishes  the  relation,  ,  , 

r'  >  R  ,  (15) 

the  denominaetors  being  unequal  in  the  reverse  order,  and  the 

numerators  identical. 

This  is  seemingly  a  contradiotory  result,  sinoe  all 
experimental  evidence  tends  to  show  thstt  r  is  smaller  than  R. 
The  explanation  is  found  in  the  fact  that  we  have  here,  not  r  , 
but  r' .      Formula(l5)  says  that  when  the  r's  for  the  sub-groups 
are  weighted  by  tlie  products  of  the  corresponding  standard  devi- 
ant ions,  the  mean  value  is  larger  than  the  correlation  coefficient 
for  the  entire  population.  But  s^^and  Sy  sere  known  to  be  correlated 
and  estch  in  turn  is  oomrelated  with  r.  Therefore  whenever  r  is 
large,  the  weight  given  to  it  tends  to  be  large  also.  This  gives 
undue  prominence  to  the  higher  values  of  r,  and  produces  the  ap- 
parent inconsistency. 

In  the  experiment  already  discussed,  the  correlation 
between  s-,  and  s^  for  the  99  pairs  turned  out  to  be  0.26b  .  I  did 
not  compute  the  correlation  between  these  and  r,  but  since  r  is 
0.555,  and  R  is  0.574,  there  is  no  doubt  a  significant  correlation 

I  made  an  effort  to  obtain  a  reduction  for  the  terms 
Sj^SyT  which  would  permit  me  to  express  exactly  the  relation  be- 
tween r  (unweighted)  and  R,  but  so  far  it  has  been  without  success 
The  inter- correlation  between  the  three  quantities  in  the  product 
offers  a  considerable  algebraic  barrier. 
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